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Dear Christine, 


In order to estimate the number of populations in the data we developed a Markov chain monte carlo sampler ("samplin", available through the bitbucket 
repository: httos://bitbucket.org/giovannidiana/samplin/src/master/) to generate samples from the posterior distribution of model parameters introduced in the 
method section for which you requested the protocol. In particular the algorithm uses the Dirichlet process prior perform transitions in the number of possible 
classes of lineages (see the work of Neal: Neal RM. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and 
Graphical Statistics. 2000;9(2):249-265 discussing the theoretical framework and also Diana, Giovanni, Thomas TJ Sainsbury, and Martin P. Meyer. 
"Bayesian inference of neuronal assemblies." PLoS computational biology 15.10 (2019) for the application of the same method with different data) 


In order to reproduce our analysis you need a C++ compiler (standard in linux or OSX systems) 


1. clone the above repository locally and install the C++ software by running "make" within a terminal. 
2. run the command "./bin/gibbs_data <iterations> <bur in> <trim> <maximum populations> <random seed> <data matrix> <output folder> 
a. <iterations> is the number of MCMC steps 
b. <burn in> is the number of samples excluded from the beginning of the Markov chain 
. <trim> corresponds to how many samples to discard along the chain before accepting a draw (usually done to reduce correlations among 
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samples) 
. <maximum populations> is the number of populations for which model parameters are recorded in the output files 
. <random seed> is used to initialize the markov chain 
. <data matrix> is the original data file which should be formatted as a matrix where each row contains the layer occupancies of a given lineage. 
. <output folder> is the folder where output files will be stored. In particular the file "P.dat" contains draws from the posterior distribution of the 
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number of populations. 


This method was specifically designed to analyze data from cortical layer occupancy of lineages using a specific model of the data. This might not be 
suitable for other datasets which might well contain population structures but which are not well described by the specific model we used. The general 
algorithm implemented in "samplin" can be still used provided that a different model is implemented. For specific informations on how to introduce a different 
model in the algorithm please contact Giovanni Diana by email at g.diana.mail@gmail.com. 


Best regards, 


Giovanni Diana 
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